By Shraddha Jhingan, Sofia Maysenhalder, Tammie Tam
March 14, 2022
In 2020, the number of reported hate crimes in the United States rose 6 percent from 2019, according to the FBI (Hernandez 2021). About 62 percent of reported hate crimes were racially motivated (FBI). In a study examining the role of social media on the frequency of hate crime in Germany, the researchers found an increased anti-refugee rhetoric and sentiments on social media has been correlated with an increased frequency in anti-refugee hate crimes in Germany (Müller and Schwarz 2021). On the flip side, we wonder how has most everyday people responded to such hate crime incidents on social media. Therefore, we are interested in the understanding the biases that motivate hate crime and how people fight against it on social media in the United States.
Using datasets on hate crime cases from the FBI's Crime Data Explorer page and Sacramento's police department, we hope to answer the following questions at the National (U.S.), State (California), and City (Sacramento) level:
Using the Reddit API, we hope to answer the following questions at the State (California) and City (Sacramento) level:
By answering these questions using data analysis and visualization methods, we hope to understand ways hate crimes can be prevented and where we can target hate-crime-prevention policies.
To obtain data about hate crimes in the United States and California, we are using the Federal Bureau of Investigation’s 1991-2020 Hate Crime Dataset (hc), which is downloadable from their Crime Data Explore webpage (FBI Crime Data Explorer). The dataset contains pertinent information, such as the year, the motivation ("bias_desc"), and location ("state_abbr", "pub_agency_name"), on reported hate crimes across the U.S from 1991-2020. Through this dataset, we want to explore which types of biases are the strongest driving factors in hate crime and how the nature of these hate crimes vary across the United States.
The dataset is overall complete, but has a few formatting issues. The column names were in inconsistent letter casing, so we made them all lower cases, and renamed column names to make them more simplistic, such as renaming “data_year” to “year”. Missing values and unnecessary columns were removed from the dataset. One interesting thing is that “Nebraska” was abbreviated “NB” when the official abbreviation for it is “NE.” This was an issue when trying to generate maps that relied on the official abbreviation, so this issue was fixed by replacing “NB” for “NE” in the “state_abbr” column. Since there were so many different specific biases, each bias was assigned a bias category for a more standardized analysis. Some reported hate crime incidents had more than one bias, so these were removed since they were only a few and would make subsequent visualizations more messy.
To analyze hate crime in the United States, I attempted to answer the following questions:
In the United States, the number of hate crimes increased from 1991 to 2001 and then steadily decreased until 2015. In 2015, the number of hate crimes increased dramatically, reaching an all time high in 2020.

The dramatic increase in the number of reported hate crimes is due to an increase in the number of racially motivated hate crimes. Other motivators of hate crime such as a person’s sex, sexuality, religion, and disability have remained relatively stabled across the three decades.
| Total Number of Hate Crimes By Motivation in the U.S. Between 1991-2020 | Change in Total Number of Hate Crimes By Motivation in the U.S. Between 1991-2020 |
|---|---|
![]() |
![]() |
Since race/ethnicity is a significant motivator in hate crime and the main driver in the increase in the number of hate crimes, understanding which anti-race/ethnicity bias is most prevalent is important. The top 5 anti-races/ethnicities biases in hate crimes from most to least are Anti-Black, Anti-White, Anti-Hispanic or Latino, Anti-Other Race/Ethnicity/Ancestry, and Anti-Asian. Out of the 5, anti-Black biases dominate significantly and are the main driver in hate crimes biased by race/ethnicity from 1991 to 2020.
| Total Number of Hate Crimes By Racial Motivation in the U.S. Between 1991-2020 | Change in Number of Hate Crimes By Different Racial Motivation in the U.S. Between 1991-2020 |
|---|---|
![]() |
![]() |
From 2019 to 2020, hate crimes against almost all races except for Arab increased. The top 5 race/ethnicities that experienced an uptick in hate crimes are those of Multiple Races, White, Asian, Black or African American, and Eastern Orthodox. According to the BBC, the rise in hate crimes against Asians is due to the misinformation and negative attention brought about by the COVID-19 pandemic, since COVID-19 originated from China (Cabral 2021). Since the Black Lives Matter Movement that started in 2013, more attention--good and bad--have been placed on Black people. According to Dr. Emmitt Y. Riley, III, this has led to racial tension and resentment from White people, which may contribute to the rise in hate crime against Black people (UC Press 2021). The rise in anti-White hate crimes is surprising at first glance. But according to the Daily Beast, white nationalist groups are spinning stories of hate crimes against White people, which may have led to over-reporting anti-White hate crimes (Hay 2021).
|
|
Since 1991, hate crimes due to racial biases have occurred in every state, but have occurred the most in California, the most populous state (World Population Review).
Click for interactive map: Total Number of Hate Crime Incidents By State since 1991

Even in 2020, California remains a leading state in hate crimes, with New York coming in a close second.
Click for interactive map: Total Number of Hate Crime Incidents By State in 2020

When examining the data on California, there were many cities so we grouped them by counties and the type of counties such as urban, suburban, and rural. To obtain information to cross reference city to county, the 2020_CA_city_to_county.csv (CA_c2c) dataset was obtained from World Population Review (World Population Review). This dataset contains information on which cities are in which counties. The counties were grouped into “urban”, “suburban”, and “rural” (CA State Association of Counties).
To make the California-only hate crime dataset, I subsetted the national hate crime dataset from the FBI by selecting rows with “CA” in the “state_abbr” column. On top of this, I cross referenced the city and county from the hate crime dataset to the CA_c2c dataset and added two columns “county” and “county_cat” to provide information on the county name of that city and the type of county it belongs to. Since there were some reported hate crime incidents that did not report back a city or county name, I removed those from the dataset.
To analyze hate crime in California, I attempted to ask similar questions as at the U.S. level:
Following the national trend, the number of hate crimes in California increased until around 2001, and decreased until 2015 before picking back up.

Since 1991, the top three biases fueling hate crimes include race/ethnicity, sexuality, and religion. Hate crimes against race/ethnicity are the most common across all years since 1991. Proportion-wise, hate crimes against race/ethnicity occur at a lesser rate than on a national level. Anti-race/ethnicity bias in hate crimes is one of the only biases to experience a significant increase around 2020, once again demonstrating anti-race/ethnicity bias to be the largest driver in the number of reported hate crimes.
| Total Number of Hate Crimes By Motivation in CA. Between 1991-2020 | Change in Total Number of Hate Crimes By Motivation in CA Between 1991-2020 |
|---|---|
![]() |
![]() |
Similar to hate crimes across the U.S. the top 5 racial motivators from highest to lowest is anti-Black or African American, Anti-Hispanic or Latino, Anti-Asian, Anti-White, and Anti-Other Race/Ethnicity/Ancestry. One main difference is that hate crimes against Hispanics or Latinos are more common than one would think from examining the patterns at the national level. This may be because California has one of the largest Hispanic and Latino populations (Statista 2019).
| Total Number of Hate Crimes By Racial Motivation in CA. Between 1991-2020 | Change in Total Number of Hate Crimes By Racial Motivation in CA Between 1991-2020 |
|---|---|
![]() |
![]() |
From 2019 to 2020, hate crimes against all races except for Arab and Multiple Race increased. Hate crimes against Arabs decreased, while hate crimes against those of multiple races remained the same. The most noteworthy increase is that hate crimes against Asians increased by nearly 120 percent. One reason for this is the start of the COVID-19 pandemic in 2020 that drew a lot of negative attention and attitudes towards Asians and those of Asian descent. Since California has the highest Asian population out of all the states, an increase in hate crimes against Asians is not surprising (World Population Review 2022). New stories of attacks on Asians have dominated the news cycle in many major California cities like Sacramento and Los Angeles (Mizes-Tan 2021, Cosgrove 2021).
|
|
Across urban, suburban, and rural counties, hate crimes against race/ethnicity dominate on top of other biases. Similarly, hate crimes against sexuality and religion follow as the next most common hate crimes.



Since urban counties have higher populations, they have the highest number of hate crimes from 1991-2020. Interestingly, only hate crimes in urban counties follow the trend found at the national and state level, such that there is an increase in hate crime until 2001 and then a decrease until around 2014 before increasing again. This trend is not found in suburban or rural counties. As a result, hate crimes in urban counties have the greatest impact on the trend seen at the state level.

Totaling the number of hate crimes from 1991 to 2020, most counties in California have comparable levels of hate crimes, but Los Angeles and San Diego County stand out with much higher levels of hate crimes against race/ethnicity. This may be attributed to the two counties having a high population (California Demographics 2020).
Click for interactive map: Total Number of Hate Crime Incidents in CA since 1991

Even in 2020, this pattern still stands with Los Angeles county and San Diego county having higher levels of racially motivated hate crimes.
Click for interactive map: Total Number of Hate Crime Incidents in CA in 2020

To analyze hate crimes in Sacramento, we will use the Sacramento Police Department's Incidence Reports between 2017 and 2021 for our analysis, which can be found on their Crime and Statistics page (Sacramento Crime Data). Since the individual incidence reports can be downloaded through separate EXCEL files, we combined these EXCEL files into one large file and then converted it to csv format to use as our dataset. This data includes the time and day in which the hate crime ("case") occurred, the location of the crime, the territory where the crime occurred ("beat"), and the type of bias that led to the hate crime. By exploring hate crimes in Sacramento, we set out to see how biases drive hate crimes on a city level during a pivotol time period which includes Trump's presidency and COVID-19 pandemic.
To clean the data, we converted the reported hate crime times from military time to 24 hour time with AM/PM specifications to make it more reader-friendly. We lowercased and condensed most of the headers to one-word titles (e.g. 'Case Number' changed to 'case') to make these variables easier to work with during our exploration. We also lowercased the different biases for a standardarized format because some were written in all upercase letters, while others were written with a mixture of lowercase and uppercase letters. Lastly, we consolidated repetitive biases under one name, for example, we made all "anti-islam" and "anti-muslim" biases "anti-islamic (muslim)." As with the USA dataset exploration, we assigned each bias to a bias category to standardize our analysis.
We explored the following questions to analyze Sacramento hate crimes:
Sacramento's trend in the total number of hate crimes per year from 2017-2021 shows a general upward trend, with the exception of 2018-2019, and a steep rise in crimes from 2020 to 2021. This could be explained by the fact that the COVID-19 pandemic started in 2020, which not only led to nationwide but city-wide uncertainty and unrest.

When taking a closer look at the rise in number of hate crimes between 2020-2021, it appears that race and ethnicity were the dominant motivators. LGBT-motivated hate crimes also contributed to a fair number of hate crimes during this time period as well. The increase in LGBT hate crimes in 2018 could have been influenced by the major anti-LGBT policies that had been passed by the Trump administration in 2017 (e.g. trans military ban or transgender student guidance reversal according to (Simmons-Duffin 2020), which contribute to anti-LGBT sentiments on a city-scale. A Time article also suggests that the cause of anti-LGBT sentiments sduring the pandemic may stem from a growing number of anti-trans bills and laws that have been passed and enacted (Carlisle 2021).
For the remaining bias categories, the trend in number of hate crimes remained relatively stable for the remaining bias categories.
| Total Number of Hate Crimes By Motivation in Sacramento Between 2017-2021 | Sacramento Hate Crime Count by Bias Category From 2017-2021 |
|---|---|
![]() |
![]() |
While the previous graph showed that LBGT-motivated hate crimes have been rising since 2020, the proportion of these crimes in relation to the other bias categories is still relatively low (approximately 0.2). On the contrary, the proportion of racially/ethnically-motivated hate crimes drastically spiked in 2019 and stabalized at approximately 0.7 from 2020-2021.

The number of hate crimes motivated by anti-black bias remained the most consistent over time, as racial and ethnic hate crimes committed against African Americans is an ongoing battle. Between 2016 and 2018, there were not any major-publicized police brutality cases. However, we bring to attention highly-publicized police brutality crimes involving Stephon Clark in 2018, George Floyd in 2020, and Daunte Wright in 2021 (BBC 2021). Each of these murders at the hands of police led to city-wide protests as part of the Black Lives Matter movement and anti-black sentiments on a city-scale. While there was a drop in anti-black hate crimes in 2020, there has since been a major rise in crimes.
Hate crimes motivated by anti-other race/ethnicity/national origin tells another story, however. The number of hate crimes motivated by anti-other race/ethnicity/national origin skyrocketed in 2021, the highest it has ever been within the 4 year period. The question of whether hate crimes towards anti-asian/pacific islander increased between 2020-2021 is addressed through this visualization, which shows that the number of hate crimes motivated by this bias relative to other biases strongly increased in 2021. While the number of anti-asian hate crimes wasn't as large as anti-other race/ethnicity/national origin motivated biases (which could still include anti-asian crimes, but the victim's race was not identified during the crime report), it was still prevelant and roughly the same number of hate crimes as anti-black hate crimes in 2018.
As indicated in the stacked barplot, the trend in anti-other race/ethnicity/national origin experienced a major spike between 2020-2021, compared to previous years and other biases. This plot also reaffirms that the trend in anti-black drive hate crimes experienced minimal fluctuations and change over time.
| Total Number of Racially-motivated Hate Crimes in Sacramento Between 2017-2021 | Change in Number of Hate Crimes By Different Racial Motivation in Sacramento Between 2017-2021 |
|---|---|
![]() |
![]() |
We also wanted to see whether there was a high concentration of hate crimes in certain areas of Sacramento. We did so by plotting coordinates belonging to each beat in Sacramento, which represent the territory or district that the crime occurred in using the SACOG Open Data Portal API (SACOG Data Portal). While we were unable to plot the exact locations using the provided addresses in our dataset due to Google API complications, plotting the hate crimes by their beats gives an approximation of where these hate crimes occurred, which is what we are interested in.
After plotting the hate crimes by beat, we found that the most number of hate crimes occurred in Central (Beats 3A and 3B) and Southwestern (Beats 4A) Sacramento, with the majority of these crimes being racially/ehtnically-motivated. These areas are near downtown Sacramento, and as mentioned in the California analysis, the most number of hate crimes typically occur in more populated regions.
Click for interactive map: Approximate Location of Hate Crimes By Beat, Categorized By Bias from 2017-2021

Map Color Legend According to Bias Category: blue: "race/ethn"
red: "lgbt"
green: "religion"
yellow: "sex"
orange: "disability"
brown: "other" (mix categories)
Throughout recent years, social media has been a way for people to express their beliefs and interact with others. With the increase in hate crimes, especially in the past five years, social media has served as an avenue for people to simultaneously engage in hateful activities and raise awareness of them. In this portion of the analysis, we will be using Reddit's API to answer the following questions:
In order to access the data needed for this portion of the analysis, we created a Reddit Account and downloaded PRAW, Python Reddit API Wrapper which allows us to scrape comments and posts from various communities on Reddit, known as subreddits. For this analysis specifically, we will primarily be using the Sacramento and California subreddit. In order to analyze the language contained within these posts, we will also be Natural Language Processing (NLP) techniques using the Pandas and NLTK libraries.
Before delving further into the posts related to hate speech, it may be useful to review Reddit's policies on hate speech. Using Reddit's API, we can access the rules of California and Sacramento's subreddit rules, as shown below.
_______________________________________________
#**Local Happenings and Resources**
**Moving to Sacramento?** Please SEARCH before asking about specific neighborhoods. It is very likely that your questions have already been answered.
**Need a room/mate?** Check out r/SacramentoHousing
**Looking for a job?** Check out r/SacJobs
**Other links:**
* [Access city services with the 311 app](http://www.cityofsacramento.org/Information-Technology/311)
* [Current SMUD outage map](https://www.smud.org/en/Customer-Support/Outage-Status)
* [Current PG&E outage map](https://pgealerts.alerts.pge.com/outages/map/)
* [Current emergency helicopter notifications](http://www.cityofsacramento.org/Police/News-Alerts/Helicopter-Notifications)
* [Current SacPD crime log](https://apps.sacpd.org/Dailies/index.aspx)
* [Sacramento Fire Department 911 call response](https://data.cityofsacramento.org/datasets/0b32edde7b14480e82d0d746108431db_0?geometry=-122.160%2C38.380%2C-120.426%2C38.756)
* [Sacramento Library special events](https://www.saclibrary.org/Events/Special-Events)
* [Sacramento Air Quality Index - Current and Forecast](https://www.airnow.gov/?reportingArea=Sacramento&stateCode=CA)
* [Sacramento Regional Transit trip planner](http://www.sacrt.com/tripplanner/gtp.aspx)
* [Sac Discord channel](https://discord.gg/trhNQZKdsJ)
________________________
#**Things to Do**
* [Start with this list](https://www.reddit.com/r/Sacramento/comments/j6dapf/bored_looking_for_something_to_do_start_with_this/)
* Regional Transit - Your friendly designated driver:
[Map of local nightlife located within 0.5mi of light rail and bus stops in Sacramento](https://www.google.com/maps/d/viewer?mid=1WSfzgxsnokIDelB5dDWeSNZh_2k&ll=38.56572534766298%2C-)
_______________________________________________
#**Community Rules:**
**1. All posts must be relevant to the Sacramento region**
We like our news local! Posts must be specific to the Sacramento region. Any posts not related to the Sacramento region will be deleted.
**2. Please SEARCH before posting**
Especially when it's about "Where are the best..." kind of links. We have a lot of great advice out there already - make the most of it!
**3. No fundraising**
We're all about supporting each other, but spammy (non-special event focused) and fundraising (Kickstarter, GoFundMe) posts and comments will be removed and spammers will be banned. Posts about upcoming charitable events are allowed.
**4. No selling**
Please use Craigslist, StubHub, Facebook Marketplace etc. for this purpose.
**5. No incivility, personal attacks or hate speech**
Racism, homophobic content and other forms of bigotry are not permitted. Such comments as well as those that threaten or advocate violence or death onto others will result in a ban.
Please comment with civility and do not personally attack others. Spirited debates are great, but if you have to resort to name calling, insults, or personal attacks, you've already lost. Such behavior will result in content removal at a minimum and a ban for repeat offenders.
**6. No spam or advertising**
Users who only use reddit to dump links to their website or ads for their business will be banned. Posts or comments that consist of self-promotion of goods or services will be deleted. Promoting local public events, however, is allowed.
Please see Reddit's [Spam](https://www.reddithelp.com/hc/en-us/articles/360043504051-What-constitutes-spam-Am-I-a-spammer-) and [Self Promotion](https://www.reddit.com/wiki/selfpromotion) guidelines if you are unclear on what is and isn't acceptable.
**7. Direct info on criminals/missing persons to police**
Direct people with information to verifiable law enforcement phone numbers or contact information only. Any missing person posts that show contact information other than law enforcement will be removed.
**8. No repetitive or low-quality content**
These include: low effort "shitposts", recent reposts, vague/no-context titles, obvious agenda posts or comments, Facebook rants, personal ads, repetitive content, etc. This rule will be applied per moderator discretion.
**9. Please follow Reddiquette and the User Agreement**
These rules here on this sub are in addition to [Reddiquette](https://www.reddithelp.com/hc/en-us/articles/205926439) and the [user agreement](https://www.redditinc.com/policies/user-agreement). Accordingly, posting personal information, harassment, and other breaches are strictly forbidden and will result in a ban.
**10. Wash your hands regularly!**
_______________________________________________
**REGION:** | |
:----|:----
[/r/California](http://reddit.com/r/California) |
[/r/WestSacramento](http://www.reddit.com/r/westsacramento) | [/r/Woodland](http://reddit.com/r/Woodland)
[/r/Folsom](http://reddit.com/r/Folsom) | [/r/DavisCA](http://reddit.com/r/DavisCA)
[/r/Roseville](http://www.reddit.com/r/Roseville/) | [/r/ElkGrove](http://www.reddit.com/r/ElkGrove/)
**COLLEGES:** | |
:----|:----
[/r/CSUS](http://reddit.com/r/CSUS) | [/r/UCDavis](http://reddit.com/r/UCDavis)
[/r/SCC](http://reddit.com/r/SCC) | [/r/SierraCollege](http://reddit.com/r/SierraCollege)
[/r/LosRios](http://reddit.com/r/LosRios) | |
**INTERESTS:**| |
:----|:----
r/CAStateWorkers | [/r/SacBike](http://www.reddit.com/r/SacBike/) | [/r/Kings](/r/Kings) | [/r/SacRepublicFC](http://www.reddit.com/r/SacRepublicFC/)
[/r/SacramentoRivercats](http://www.reddit.com/r/SacramentoRivercats/) | [/r/SacMoto](http://www.reddit.com/r/SacMoto/) |
***[...more California subreddits](https://www.reddit.com/r/California/wiki/faq)***
___
*Thanks for helping make /r/Sacramento a great place to surf and share all things Sacramento! [Please message the Mods](https://www.reddit.com/message/compose/?to=/r/Sacramento) if you see anything amiss!*
*** ### ---- ## **We have user flair** ## There is user flair for every county, most regions (e.g. Bay Area), and a few general flair (e.g. Tourist, Former Californian). ---- ## **Welcome to r/California!** ## **This subreddit is for news and information specifically about California of general interest to folks all across the state.** Covering all of California from [Yreka](/r/dirty530) to [Chula Vista](/r/chulavista), from [San Francisco](/r/SanFrancisco) to [Lake Tahoe](/r/Tahoe), from [Death Valley](http://en.wikipedia.org/wiki/Death_Valley) to [Mount Whitney](http://en.wikipedia.org/wiki/Mount_Whitney) and even [Joshua Tree](/r/JoshuaTree), plus occasionally even [Baja California](r/BajaCalifornia). ---- **Regional, and major city reddits:** * [Bay Area](/r/BayArea) * [Bakersfield](/r/Bakersfield) * [Coachella Valley](/r/CoachellaValley) * [Central Valley](/r/CentralValley) * [Fresno](/r/Fresno) * [Humboldt](/r/Humboldt) * [Inland Empire](/r/InlandEmpire) * [Los Angeles](/r/LosAngeles) * [Long Beach](/r/LongBeach) * [Northern California](/r/NorCal) * [Orange County](/r/OrangeCounty) * [Oakland](/r/Oakland) * [Sacramento](/r/Sacramento) * [San Diego](/r/SanDiego) * [San Francisco](/r/SanFrancisco) * [San Jose](/r/SanJose) * [San Luis Obispo](/r/SLO) * [Santa Barbara](/r/SantaBarbara) * [Santa Cruz](/r/SantaCruz) * [SierraNevada](/r/SierraNevada) * [Tahoe](/r/Tahoe) * [Yosemite](/r/Yosemite) **Some other California Subreddits:** * /r/CaliforniaPics * [/r/unitedstatesofamerica - California pictures](https://www.reddit.com/r/unitedstatesofamerica/search?q=flair%3A%27california&sort=top&restrict_sr=on) * [/r/EarthPorn - California photos](https://www.reddit.com/r/EarthPorn/search?q=California+&sort=new&restrict_sr=on&t=all) * /r/California_Politics ---- **The answers to r/California's most asked questions:** * [The Coastal California Road Trip Megathread](https://redd.it/5lx0yg) ---- **Wiki pages and multireddits:** Wiki pages: * [California based subreddits](http://www.reddit.com/r/California/wiki/faq) * [California websites](https://www.reddit.com/r/California/wiki/websites) California multireddits: * [Major California city and region subreddits](https://www.reddit.com/user/BlankVerse/m/calmajorcities) * [California Sports Teams](https://www.reddit.com/user/BlankVerse/m/calprosports) ---- **California Politics:** * **[Online voter registration](http://registertovote.ca.gov/)** * [Check you current voter registration status](https://voterstatus.sos.ca.gov) * [Find your California state representative and senator](http://findyourrep.legislature.ca.gov) * [Find all of your local politicians \(city, county, state, & national\) - League of Women's Voters](http://hq-salsa.wiredforchange.com/o/5950/getLocal.jsp) * [Find your local politicians - Common Cause](http://www.commoncause.org/take-action/find-elected-officials/) ---- Check out the other state reddits with the [50 state mult-link](http://www.reddit.com/r/Alabama+Alaska+Arizona+Arkansas+California+Colorado+Connecticut+Delaware+Florida+Georgia+Hawaii+Idaho+Illinois+Indiana+Iowa+Kansas+Kentucky+Louisiana+Maine+Maryland+Massachusetts+Michigan+Minnesota+Mississippi+Missouri+Montana+Nebraska+Nevada+NewHampshire+NewJersey+NewMexico+NewYork+NorthCarolina+NorthDakota+Ohio+Oklahoma+Oregon+Pennsylvania+RhodeIsland+SouthCarolina+SouthDakota+Tennessee+Texas+Utah+Vermont+Virginia+Washington+WestVirginia+Wisconsin+Wyoming+PuertoRico/). ---- Please visit [/r/California/new](http://www.reddit.com/r/california/new/?sort=new) occasionally to help keep good links from being downvoted into oblivion, as well as watch for any posts that might break reddit's or this sub's posting rules. You can also check [new /r/California comments](https://www.reddit.com/r/California/comments). ---- Looking for a reddit for a California city, college, or sports team? Check the [List of California reddits](http://www.reddit.com/r/California/wiki/faq) and another list with [North American (including Californian) subs](http://www.reddit.com/r/LocationReddits/wiki/faq/northamerica#toc_40) ---- ## **Posting Rules** ## * **All posts must be a primarily about California.** * Basic sub rule: **Be civil.** * Follow basic reddiquette and reddit site rules. * **NO** insults or incivility, trolling, bigotry, profanity or hate. Nothing that's rude, vulgar or offensive. Nothing gross or disgusting. * **NO** doxxing — this includes looking up a user's history. * **NO** videos * **NO** editorialized titles * No memes or image macros. * No spam or reposts. . * Additional posts on the same topic without major new changes will be deleted. * Original sources only. No blogspam or rehosted/republished articles * Only or mostly posting one website is considered [self-promotion](https://www.reddit.com/wiki/selfpromotion) and is banned from this sub. * No websites or articles with hard paywalls or that require registration or subscriptions. * No streaming, podcasts, or animated GIFs. * Nothing NSFW/NSFL * No URL shorteners, archive links, bookmarking links, redirectors, link disguisers, news aggregators, or other websites that hides or changes the final destination URL or source of the original article. * No links that open another program like Spotify or apple.news * Do not post links to articles or photos as a text post. * No Twitter, Facebook Instagram, TikTok, or other social media websites, and no personal blog posts. * No polls, surveys, petitions, fundraising, or school projects. * No infographics * No /new queue flooding. No more than 1 post/½ hour and 5 posts/day * Do not add the website name to the title. If the website adds their name to the title, delete it before posting the article. * No satire or parody. * All posts must be current California news and info, plus be of interest to folks from across the state. (Some California history is also cool.) * If a post is of only local interest, or the only connection is a person was born or lives in California or is a company that is based in California, or it's entertainment, celebrity, or sports news, it is not appropriate. * No local crime reports, local politics, local news articles, or local questions. * Try to post more detailed articles from local news sources instead of brief news wire articles from an out-of-state newspaper. * No political news unless it is specifically about California or is about California politicians as they relate to California. * California is HUGE. If your title doesn't include it, add the location in brackets like this [Santa Ana, CA]. If it is a small city or CDP, include the county or region, eg [Bell, Los Angeles County]. * For local or regional questions, try /r/AskLosAngeles, /r/AskSF, or /r/AskSanDiego or [other local subs](http://www.reddit.com/r/California/wiki/faq) instead. * For questions, please be as detailed and specific as possible to make it easier to answer your question. * Especially if you have a DMV, vacation, travel, working here, or moving here question, please search the sub's archives because it's likely your question has been answered multiple times in the past. * Please don't say Frisco, San Fran, The OC, Diego, or Cali. Most Californians don't use those nicknames. * All image posts must be by the original creator, have a link crediting the original creator, or have a link to the original source(s) of the image(s). All photos must be high quality images worth posting to one of the SFW Porn Network, such as /r/EarthPorn, and clearly identifiable as a California photo. * No breaking news photos of fires, accidents, ect. No potato-quality cellphone photos. * No ALL CAPS words or text, **no bold text**, and no #Headline text formatting! This is like SHOUTING and is considered **rude.** No emoji. 😟 * No ALL CAPS TITLES or any ALL CAPS WORDS in titles, even if the original headline was in all capital letters. * No missing persons, lost pets, or stolen car posts. * No crossposts from or links to other subs on reddit. * Do not put "x-post", "xpost", or "crosspost" in your title. * No posts using the redd.it shortlink and no links to other posts or threads on reddit. * No slideshows or multi-page articles. * No leading questions or loaded questions. * If you ask a question, you must reply to at least one comment within 24 hrs or your post will be deleted. * All posts must be in English. * Reddit is not Twitter or txt messaging. Complete sentences and proper spelling are expected. No #hashtags or @at-tags in titles. * Op-Eds (opinion pieces and editorials) must be labeled [Op-Ed], [Editorial], or [Opinion]. Political columns should be labeled [Political Column]. * Questions that can be easily answered from the Wikipedia or by a quick Google search will be deleted. * Please use descriptive titles. No vague, misleading, or click-bait titles. * Don't modify article titles except to add a location in brackets unless the title is excessively misleading, vague, or clickbait-ish. * Don't rely upon reddit's "use suggested title" feature. * Do include the article's subtitle if it makes things clearer. ---- * **[Reporting an inappropriate link or comment for r/California](https://www.reddit.com/r/California/wiki/index#wiki_reporting_an_inappropriate_link_or_comment_for_r.2Fcalifornia.3A)** ---- The California flag photo in the sidebar was taken by [Håkan Dahlström](https://www.flickr.com/photos/dahlstroms/4136725536/) (CC by 2.0). ---- * [/California Traffic Stats](/r/California/about/traffic/) ---- ----
As we can see from both subreddits, hate speech is banned. From Sacramento's subreddit rules: "Racism, homophobic content and other forms of bigotry are not permitted. Such comments as well as those that threaten or advocate violence or death onto others will result in a ban.
Please comment with civility and do not personally attack others. Spirited debates are great, but if you have to resort to name calling, insults, or personal attacks, you've already lost. Such behavior will result in content removal at a minimum and a ban for repeat offenders."
Similarly, from California's subreddit rules: "* NO insults or incivility, trolling, bigotry, profanity or hate. Nothing that's rude, vulgar or offensive."
Thus, we can see that hate speech is explicitly banned on Reddit. From this, we can further analyze if hate speech still occurs and what people's response to it is.
Now, focusing on Sacramento's subreddit, first we will be using the API and web scraping to get the top 10 posts on the subreddit. Using these, we can analyze what portion of them are linked to hate crime. One important thing to note is that this is dependent on the day. The top 10 posts from the Sacramento subreddit in this analysis are from March 9.
As we can see from the top 10 posts from the Sacramento subreddit, most of them are not linked to hate crimes. However, two of them stand out from the rest: the fourth one and the seventh one, because they are expressing shock or are related to a racial hate crime. This makes about 80% on the Sacramento subreddit not linked to conflict or hate crimes, based on the top 10 posts.
In order to analyze people's response to the incidence of hate crime, we can delve further into the comments of the post titled "Investigation Underway Into Racially Charged Heckling at El Dorado Hills Soccer Game.
#obtaining comments from the post
submission = reddit.submission(url="https://www.reddit.com/r/Sacramento/comments/tadt1s/investigation_underway_into_racially_charged/")
corpus1=[]
submission.comments.replace_more(limit=0)
for comment in submission.comments.list():
corpus1.append(comment.body)
print(comment.body)
Not shocking , still disgusting nonetheless. Of course it was EDH. Racists? in EDH? I don't believe it. Glad to be see they have identified the student. What type of punishment do you feel would be appropriate for the student/school? I think it would’ve been best to stop the game until the person(s) were pointed out and removed from the stadium. Is EDH known to be a racist place? "We cant be racists! We go to Rockin Sushi Nights!" I believe the student has lost his spot and scholarship to his chosen college. Ruined his own life. The kids were all protesting at the school yesterday, asking for accountability. Agreed. I don’t understand why no one said anything. Where are the parents? did EDC win? If so, not anymore. Their bad sportsmanship should be an automatic forfeit. >Is EDH known to be a racist place? [Who's to say really](https://www.thedailybeast.com/wealthy-california-neighborhood-called-cops-on-busload-of-riotersthey-were-actually-black-entrepreneurs) If I base my answer on my family that lives there, it’s not even possible that they’re racist since they have a friend who is <insert race here>. Oh and they *didn’t vote* for T (but are fine with all the other Rs because *not D*). /s for the ones who need it And this isn’t the first time their athletes have heckled their opponents over race. I've lived in the suburbs or exurbs east of Sacramento for many years and I would never paint the entire area with a broad brush, 44% of people in EDC voted Dem last election for example, BUT it's impossible to ignore the fact that there's a faction of folks who moved to the area because it's more white than other parts of the city. I'd say that a lot of the cities East and Northeast of Sac struggle with white supremecy. Absolutely Yes What’s Rockin sushi nights? Complicity is guilt in this cake, I believe I’ve played a lot of sports and officiated sports. That’s never happened. Even when a fan charged the field and physically attacked an official, they’ve never penalized the team. Fucking Folson... I'm guessing Daily Beast needs an editor. >Is EDH known to be a racist place? [No one knows for certain](https://www.kcra.com/article/dad-racial-slurs-cheered-during-mcclatchy-hs-oak-ridge-hs-game/6427576) It's a sushi place owned by Sha-Na-Na with Mexican cooks. Hey! Who put guilt in my cake?! The team is apart of their school body, the school body who did nothing to address the racist in their bleachers, are they not? Maybe they should start. Wasn't even folsom Poor sportsmanship on the part of the players is something where the school can more easily be held accountable. It’s much harder to hold parents, grandparents and other spectators to the same standard because it’s harder to positively connect a particular fan to one team or another. If it was not a student then the person who attended should have been permanently banned from all future events but truthfully I find it disgustingly hard to believe the investigation should take very long with out the abusers being protected. There really shouldn't even have to be one. Almost Everyone on those bleachers that night knows who it was and they all make me sick.
From initially looking at the comments, we can see that a lot of people are asking questions or expressing their opinion on the topic.
To understand the connotations behind the words used, and whether there is a pattern behind them, we are going to be performing NLP, specifically Frequency Analysis.
The Frequency Analysis shows that there about 617 words in the comments, and though we have lost information about which order the words are in, which is a disadvantage of this technique, we can see that most of the words do not seem to repeat themselves. However, we can see that there are words linked to hate crimes and violence such as "abusers," "shocking" and "disgusting." Yet, from the comments of the post shown earlier, in this instance they are expressing disdain and shock at the incident, not spreading hate.
Something to note is that with this technique, there may be words that are used frequently in the English language. We can also use one-hot encoding which does not account for the frequency of words.
words = nltk.word_tokenize(comments)
from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer(tokenizer = nltk.word_tokenize)
freq = vec.fit_transform(words)
vec.get_feature_names()
['!',
'%',
"''",
"'d",
"'m",
"'s",
"'ve",
'(',
')',
'*',
',',
'.',
'...',
'//www.kcra.com/article/dad-racial-slurs-cheered-during-mcclatchy-hs-oak-ridge-hs-game/6427576',
'//www.thedailybeast.com/wealthy-california-neighborhood-called-cops-on-busload-of-riotersthey-were-actually-black-entrepreneurs',
'/s',
'44',
':',
'<',
'>',
'?',
'[',
']',
'``',
'a',
'absolutely',
'abusers',
'accountability',
'accountable',
'address',
'agreed',
'all',
'almost',
'an',
'and',
'another',
'answer',
'anymore',
'anything',
'apart',
'appropriate',
'are',
'area',
'asking',
'at',
'athletes',
'attacked',
'attended',
'automatic',
'bad',
'banned',
'base',
'be',
'beast',
'because',
'been',
'being',
'believe',
'best',
'bleachers',
'body',
'broad',
'brush',
'but',
'by',
'cake',
'can',
'cant',
'certain',
'charged',
'chosen',
'cities',
'city',
'college',
'complicity',
'connect',
'cooks',
'course',
'd',
'daily',
'dem',
'did',
'didnâ\x80\x99t',
'disgusting',
'disgustingly',
'do',
'donâ\x80\x99t',
'easily',
'east',
'edc',
'edh',
'editor',
'election',
'entire',
'even',
'events',
'everyone',
'example',
'exurbs',
'fact',
'faction',
'family',
'fan',
'feel',
'field',
'find',
'fine',
'first',
'folks',
'folsom',
'folson',
'for',
'forfeit',
'friend',
'from',
'fucking',
'future',
'game',
'glad',
'go',
'grandparents',
'guessing',
'guilt',
'happened',
'hard',
'harder',
'has',
'have',
'heckled',
'held',
'here',
'hey',
'his',
'hold',
'https',
'i',
'identified',
'if',
'ignore',
'impossible',
'in',
'insert',
'investigation',
'is',
'isnâ\x80\x99t',
'it',
'itâ\x80\x99s',
'iâ\x80\x99ve',
'kids',
'known',
'knows',
'last',
'life',
'lived',
'lives',
'long',
'lost',
'lot',
'make',
'many',
'maybe',
'me',
'mexican',
'more',
'moved',
'much',
'my',
"n't",
'need',
'needs',
'never',
'night',
'nights',
'no',
'nonetheless',
'northeast',
'not',
'nothing',
'of',
'official',
'officiated',
'oh',
'on',
'one',
'ones',
'opponents',
'or',
'other',
'out',
'over',
'own',
'owned',
'paint',
'parents',
'part',
'particular',
'parts',
'penalized',
'people',
'permanently',
'person',
'physically',
'place',
'played',
'players',
'pointed',
'poor',
'positively',
'possible',
'protected',
'protesting',
'punishment',
'put',
'race',
'racist',
'racists',
'really',
'removed',
'rockin',
'rs',
'ruined',
's',
'sac',
'sacramento',
'said',
'same',
'say',
'scholarship',
'school',
'see',
'sha-na-na',
'shocking',
'should',
'sick',
'since',
'so',
'something',
'spectators',
'sports',
'sportsmanship',
'spot',
'stadium',
'standard',
'start',
'still',
'stop',
'struggle',
'student',
'student/school',
'suburbs',
'supremecy',
'sushi',
't',
'take',
'team',
'than',
'that',
'thatâ\x80\x99s',
'the',
'their',
'then',
'there',
'they',
'theyâ\x80\x99re',
'theyâ\x80\x99ve',
'think',
'this',
'those',
'time',
'to',
'truthfully',
'type',
'understand',
'until',
'very',
'vote',
'voted',
'was',
'we',
'were',
'what',
'whatâ\x80\x99s',
'when',
'where',
'white',
'who',
'why',
'win',
'with',
'would',
'wouldâ\x80\x99ve',
'years',
'yes',
'yesterday',
'you']
from sklearn.preprocessing import Binarizer
binarizer = Binarizer()
ohot = binarizer.fit_transform(freq)
ohot.todense()
matrix([[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
...,
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0]])
The code snippets above show the individual words from the comments of the post, and the matrix below that contains the corresponding frequencies. From the output, we can see that not many words seem to repeat themselves in the comments.
Therefore, from an analysis of Sacramento's social media using Reddit's API and NLP, we can see that there does not seem to be much hateful language being used. There are words with negative connotations, however we can see from the comments that they are being used mainly in the context of responding with shock to the incidence, not spreading hate. To start off with, not many posts were linked to violence or hate crime in the first place, only 20%.
We can now compare this to the subreddit for the State of California.
Like we did for Sacramento's subreddit, we can get the top 10 most popular posts for California's subreddit.
Similar to Sacramento, not many of the posts are linked to hate crimes or violence. Out of all of them, the one that seems to be the most linked to racial issues is titled: "California legislators are in agreement: It’s time for the state to repeal a racist, classist provision in the state Constitution that makes it harder to build affordable housing. — Article 34 requires that cities get voter approval before they build “low-rent housing” funded with public dollars."
We can now delve into the comments of the post.
Do cities even build anything themselves anymore? Public housing is a fed thing isnt it? Don't think this is a limiting factor in housing shortage these days. Bypassing the paywall: https://12ft.io/proxy?q=https%3A%2F%2Fwww.latimes.com%2Fopinion%2Fstory%2F2022-03-14%2Feditorial-racist-california-article-34-public-affordable-housing There are multiple root cause of the problems. The main one is that housing has many conflicting purposes. For some, it's a place to live. For others, it's a way to build wealth while having a place to live, and for others, it's purely an investment, where all that matters is rate of return. Add to that the fact that many people are paid really poorly to work in expensive areas and it appears to be a nearly impossible problem to solve. And no, "low-rent housing" is not the answer Umm, people should get a say where their tax dollars go to… Yep. The solutions for each group are somewhat mutually exclusive and each demographic is too large, so there isnt a polically workable solution. One group wants cheap available housing, the others don't. They do! The state budget is proposed by the Governor (who is elected), passed by the Legislature (two houses, each Californian has two representatives - both of them elected!). Repealing this article will require it to be on the ballot. Which.. guess what, is something people can vote on! You should definitely be cc'd on every government expense form and email then, and have a veto over any you disagree with. I'm sure that will be an excellent use of your time. > It’s a remnant of an era that California should repudiate. A real estate industry group drafted the original initiative to require voter approval for public housing in 1950 — right after the federal Housing Act of 1949 banned explicit racial segregation in public housing. The initiative was framed as a way for residents to preserve “local control.” But although it was cynically wrapped in the guise of grass-roots democracy, giving voters the right to veto public housing was really just a sneaky way to let the mostly white voters bar low-income and minority residents from their communities. And bigots get to bigot!
Just from scraping the comments of that post, we can see that most are expressing their opinions on the issue. There do not seem to be many instances of words linked to hate crime, but we can explore this further using NLP.
From performing Frequency Analysis on the comments of the post, we can see which words are used frequently and whether any of them seem to be linked to hate crimes.
From performing Frequency Analysis, we can see that there are more words with negative connotations than in the comments from Sacramento's subreddit, such as "sneaky" and the last two words in the matrix. Additionally, there seems to be more words linked to race such as "white" and "minority". We can explore this further by obtaining the individual words from the comments.
cali_comments = nltk.corpus.gutenberg.raw("/Users/shraddhajhingan/Desktop/cali_comments.txt")
words = nltk.word_tokenize(cali_comments)
from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer(tokenizer = nltk.word_tokenize)
freq = vec.fit_transform(words)
vec.get_feature_names()
['!',
'%',
"''",
"'s",
'(',
')',
',',
'-',
'.',
'..',
'//12ft.io/proxy',
'1949',
'1950',
'2f',
'2f2022-03-14',
'2feditorial-racist-california-article-34-public-affordable-housing',
'2fopinion',
'2fstory',
'2fwww.latimes.com',
'3a',
':',
'>',
'?',
'``',
'a',
'act',
'add',
'after',
'all',
'although',
'an',
'and',
'answer',
'appears',
'approval',
'are',
'areas',
'article',
'as',
'ballot',
'banned',
'bar',
'be',
'bigot',
'bigots',
'both',
'budget',
'build',
'but',
'by',
'bypassing',
'california',
'californian',
'can',
'cause',
'communities',
'conflicting',
'control.â\x80\x9d',
'cynically',
'democracy',
'do',
'dollars',
'drafted',
'each',
'elected',
'era',
'estate',
'expensive',
'explicit',
'fact',
'federal',
'for',
'framed',
'from',
'get',
'giving',
'go',
'governor',
'grass-roots',
'group',
'guess',
'guise',
'has',
'having',
'houses',
'housing',
'https',
'impossible',
'in',
'industry',
'initiative',
'investment',
'is',
'it',
'itâ\x80\x99s',
'just',
'legislature',
'let',
'live',
'low-income',
'low-rent',
'main',
'many',
'matters',
'minority',
'mostly',
'multiple',
'nearly',
'no',
'not',
'of',
'on',
'one',
'original',
'others',
'paid',
'passed',
'paywall',
'people',
'place',
'poorly',
'preserve',
'problem',
'problems',
'proposed',
'public',
'purely',
'purposes',
'q=https',
'racial',
'rate',
'real',
'really',
'remnant',
'repealing',
'representatives',
'repudiate',
'require',
'residents',
'return',
'right',
'root',
'say',
'segregation',
'should',
'sneaky',
'solve',
'some',
'something',
'state',
'tax',
'that',
'the',
'their',
'them',
'there',
'they',
'this',
'to',
'toâ\x80¦',
'two',
'umm',
'veto',
'vote',
'voter',
'voters',
'was',
'way',
'wealth',
'what',
'where',
'which',
'while',
'white',
'who',
'will',
'work',
'wrapped',
'â\x80\x94',
'â\x80\x9clocal']
The words above seem to be more politically or racially linked than the words from the comments on Sacramento's subreddit. From this and the previous analyses, we can see that while California's subreddit contains less posts that are racially or conflict-linked compared to Sacramento's, the words used in California's subreddit are more politically and racially-affiliated. However, in both cases, such language is being used to express disdain on the topic or ask questions, not perform a hate crime.
Lastly, we can also take a look at the top most popular post from the past year on California's subreddit. The top post is titled: "California Defies Doom With No. 1 U.S. Economy" and it has 903 comments. In order to analyze it, I used NLP to tokenize the comments into words. From this, I created a dataframe in which each row is an individual word.
The plot above shows the most common bigrams from the comments of the post. As we can see, none of them are linked to hate crimes or contain instances of violence.
We can also perform Frequency Analysis on the comments of this post, in order to investigate whether any of the words seem to repeat themselves.
Similar to the other posts from the State of California and Sacramento's subreddits, here not many words seem to repeat themselves. There also do not appear to be many words with negative connotations. However, something interesting is that there appears to be more numbers such as dates and prices, suggesting that rather than being linked to race, people's discussion of the economy is primarily limited to the scope of the economy. This shows that racial biases do not play much of a role in the discussion of economical issues in California's subreddit.
We can also see what the most popular words in the comments are.
From the wordcloud above, we can see that the majority of the comments are neutral and do not have any racial or other hate speech linked negative connotations. The most common words seem to be "state", "California", "like" and "people." Thus even on a post that discusses economic issues, people still tend to use neutral language.
Looking at the national and state level, we find that changes in the number of hate crime due to race/ethnicity bias consistently drive the changes seen on the overall number of hate crime. A higher number of hate crime across state and county levels seems to be associated with a higher population. The recent uptick in the number of hate crime due to race/ethnicity bias coincide with the COVID-19 pandemic. As such, we found an increase in the number of hate crimes targeting Asians, due to the negative sentiment and attitude regarding COVID-19's origin in China.
Now understanding what motivates hate crime, we looked towards social media to see how people respond to hate crime. Ultimately, from the analysis of social media using Reddit's API and Natural Language Processing for the State of California and Sacramento, we can draw the following conclusions:
# Import library and packages
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import folium
import json
# Read in hate_crime.csv dataset
hc = pd.read_csv("./../datasets/hate_crime.csv")
# make header names lowercase so easier to work with
hc.columns = [col.lower() for col in hc.columns]
# Rename data_year col to "year"
hc.rename(columns = {"data_year":"year"}, inplace = True)
# Remove rows with missing values about offender's race
hc = hc.dropna(subset = ["offender_race"])
# Remove unnecessary columns
hc = hc.drop(["ori", "pub_agency_unit"], axis = 1)
# Fix nebraska abbreviation
hc = hc.replace("NB", "NE")
# Create bias categories: race/ethnicity (race/ethn), "sex", "lgbt", "religion", "disability", "other"
bias_cat_dic = {"race/ethn": ['Anti-Black or African American', 'Anti-White','Anti-Arab',
'Anti-Asian', 'Anti-Hispanic or Latino','Anti-Other Race/Ethnicity/Ancestry',
'Anti-Multiple Races, Group', 'Anti-American Indian or Alaska Native',
'Anti-Native Hawaiian or Other Pacific Islander',
'Anti-Eastern Orthodox (Russian, Greek, Other)'],
"sex": ['Anti-Female', 'Anti-Male'],
"lgbt": ['Anti-Gay (Male)','Anti-Heterosexual','Anti-Lesbian (Female)',
'Anti-Lesbian, Gay, Bisexual, or Transgender (Mixed Group)','Anti-Bisexual',
'Anti-Gender Non-Conforming','Anti-Transgender',],
"religion":['Anti-Jewish','Anti-Protestant','Anti-Other Religion','Anti-Islamic (Muslim)',
'Anti-Catholic','Anti-Multiple Religions, Group','Anti-Atheism/Agnosticism',
"Anti-Jehovah's Witness",'Anti-Mormon','Anti-Buddhist','Anti-Sikh',
'Anti-Other Christian','Anti-Hindu',],
"disability":['Anti-Physical Disability','Anti-Mental Disability'],
"other":["Unknown (offender's motivation not known)"]}
# ID what bias category each hate crime incidence is and add to dataframe
bias_cat = []
for bias in hc["bias_desc"]:
if ";" not in bias:
for key in bias_cat_dic:
bias_lst = bias_cat_dic[key]
if bias in bias_lst:
bias_cat.append(key)
break
elif ";" in bias:
# for incidents with more than one biases (these bias_cat begins with "mix")
sub_biases = bias.split(";")
temp_cat = "mix"
for sub_bias in sub_biases:
for key in bias_cat_dic:
bias_lst = bias_cat_dic[key]
if sub_bias in bias_lst:
temp_cat = temp_cat + "," + key
break
bias_cat.append(temp_cat)
# Add bias category to dataframe
hc["bias_cat"] = bias_cat
# Remove columns with more than one bias
hc = hc[~hc["bias_cat"].str.contains("mix")]
C:\Users\tammi\AppData\Roaming\Python\Python38\site-packages\IPython\core\interactiveshell.py:3145: DtypeWarning: Columns (19) have mixed types.Specify dtype option on import or set low_memory=False. has_raised = await self.run_ast_nodes(code_ast.body, cell_name,
# Import county to city key
CA_c2c = pd.read_csv("../datasets/2020_CA_city_to_county.csv")
# Label which county is urban, suburban, rural
county_cat_dic = {"urban":["Alameda", "Contra Costa", "Fresno", "Los Angeles", "Orange", "Riverside", "Sacramento",
"San Bernardino", "San Diego", "San Francisco", "San Joaquin", "San Mateo", "Santa Clara", "Ventura"],
"suburban":["Butte", "Imperial", "Kern", "Marin", "Merced", "Monterey", "Napa", "Placer", "San Luis Obispo",
"Santa Barbara", "Santa Cruz", "Shasta", "Solano", "Sonoma", "Stanislaus", "Tulare", "Yolo"],
"rural":["Alpine", "Amador", "Calaveras", "Colusa", "Del Norte", "El Dorado", "Glenn", "Humboldt",
"Inyo", "Kings", "Lake", "Lassen", "Madera", "Mariposa", "Mendocino", "Modoc", "Mono", "Nevada", "Plumas",
"San Benito", "Sierra", "Siskiyou", "Sutter", "Tehama","Trinity","Tuolumne","Yuba"]}
# Add county category to 2020_CA_city_to_county.csv
county_cat = []
for county in CA_c2c["county"]:
if county in county_cat_dic["urban"]:
county_cat.append("urban")
elif county in county_cat_dic["suburban"]:
county_cat.append("suburban")
elif county in county_cat_dic["rural"]:
county_cat.append("rural")
else:
county_cat.append(None)
# Add country category to CA_c2c
CA_c2c["county_cat"] = county_cat
# Make a dataframe with just california data
CA_hc = hc[hc["state_abbr"] == "CA"]
# Annotating whether county is urban, suburban, and rural
county_cat = []
counties = []
city_names = list(CA_c2c["city"])
for city in CA_hc["pub_agency_name"]:
city = city.strip()
if city in city_names:
county_cat.append(list(CA_c2c[CA_c2c["city"]==city]["county_cat"])[0])
counties.append(list(CA_c2c[CA_c2c["city"]==city]["county"])[0])
elif city in county_cat_dic["urban"]:
county_cat.append("urban")
counties.append(city)
elif city in county_cat_dic["suburban"]:
county_cat.append("suburban")
counties.append(city)
elif city in county_cat_dic["rural"]:
county_cat.append("rural")
counties.append(city)
else:
county_cat.append(None)
counties.append(None)
# Adding county_cat as a column to CA hc dataframe
CA_hc["county"] = counties
CA_hc["county_cat"] = county_cat
# Remove rows with no county annotation
CA_hc = CA_hc[~CA_hc["county_cat"].isna()]
<ipython-input-4-a372980b69dc>:27: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy CA_hc["county"] = counties <ipython-input-4-a372980b69dc>:28: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy CA_hc["county_cat"] = county_cat
# Group number of hate crime incidents per year
num_hc = pd.DataFrame(hc.groupby("year")["incident_id"].count())
num_hc.reset_index(inplace = True)
num_hc.columns = ["year","num_incidents"]
# Plot reported hate crime incidents from 1991-2020
num_hc_plot = sns.lineplot(data = num_hc, x = "year", y = "num_incidents")
num_hc_plot.set(xlabel = "Year", ylabel = "Number of Hate Crime Incidents",
title = "Change in Total Number of Hate Crimes\n in the U.S. Between 1991-2020")
[Text(0.5, 0, 'Year'), Text(0, 0.5, 'Number of Hate Crime Incidents'), Text(0.5, 1.0, 'Change in Total Number of Hate Crimes\n in the U.S. Between 1991-2020')]
# Calculate num of hate crime by bias category each year
num_bias_hc = pd.DataFrame(hc.groupby(["year","bias_cat"])["incident_id"].count())
num_bias_hc.reset_index(inplace = True)
num_bias_hc.columns = ["year", "bias_cat", "num_bias_cat"]
# Rearrange dataframe to make stacked bar charts
def transpose_df(df, loop_lst, biastype, colname):
"""
Makes bias category rows into columns to allow for stack bar charts
df = pandas hatecrime dataframe
biastype = bias_cat by default, and can be specified to
colname = column name of interest
"""
t_dic = {}
for bias in loop_lst:
bias_rows = list(df[df[biastype] == bias][colname])
while len(bias_rows) < 30:
bias_rows = [0] + bias_rows
t_dic[bias] = bias_rows
t_df = pd.DataFrame(t_dic, index = range(1991,2021))
return t_df
# rearrange columns of num_bias_hc in order to make bar plot
num_bias_hc_t_df = transpose_df(num_bias_hc,bias_cat_dic.keys(), "bias_cat", "num_bias_cat")
# Bar plot of number of reported hate crimes by bias category over the years
num_bias_hc_plt = num_bias_hc_t_df.plot(kind='bar', stacked=True)
num_bias_hc_plt.set(xlabel = "Year", ylabel = "Number of Hate Crime Incidents",
title = "Total Number of Hate Crimes By Motivation\n in the U.S. Between 1991-2020")
plt.legend(bbox_to_anchor = [1, 1])
<matplotlib.legend.Legend at 0x18201cd3970>
# Line plot of number of reported hate crimes by bias category over the years
num_bias_hc_plt = sns.lineplot(data = num_bias_hc[["year", "num_bias_cat", "bias_cat"]], x = "year",
y = "num_bias_cat", hue = "bias_cat")
num_bias_hc_plt.set(xlabel = "Year", ylabel = "Number of Hate Crime Incidents",
title = "Change in Total Number of Hate Crimes By Motivation\n in the U.S. Between 1991-2020")
plt.legend(bbox_to_anchor = [1, 1])
<matplotlib.legend.Legend at 0x18208b10be0>
# Calc num of hate crime by each specific racial bias over years
racebias_hc = pd.DataFrame(hc[hc["bias_cat"] == "race/ethn"].groupby(["year", "bias_desc"])["incident_id"].count())
racebias_hc.reset_index(inplace = True)
racebias_hc.columns = ["year", "bias_desc", "num_race_incident"]
# rearrange columns of racebias_hc in order to make bar plot
num_racebias_hc_t_df = transpose_df(racebias_hc, bias_cat_dic["race/ethn"], "bias_desc", "num_race_incident")
# Bar plot of number of reported hate crimes by race/ethn category over the years
num_racebias_hc_plt = num_racebias_hc_t_df.plot(kind='bar', stacked=True)
num_racebias_hc_plt.set(xlabel = "Year", ylabel = "Number of Hate Crime Incidents",
title = "Total Number of Hate Crimes By Racial Motivation\n in the U.S. Between 1991-2020")
plt.legend(bbox_to_anchor = [1, 1])
<matplotlib.legend.Legend at 0x18208de4280>
# Take top 5 races with most hate crime incidents to plot a line plot
top_race = racebias_hc.groupby("bias_desc").sum().sort_values("num_race_incident", ascending = False)
top_5_race = list(top_race.iloc[0:5].index)
top_5_race_hc = racebias_hc[racebias_hc["bias_desc"].isin(top_5_race)]
# Line plot of number of reported hate crimes by top 5 race/ethn category over the years
racebias_hc_plt = sns.lineplot(data = top_5_race_hc[["year", "num_race_incident", "bias_desc"]],
x = "year", y = "num_race_incident", hue = "bias_desc")
racebias_hc_plt.legend(bbox_to_anchor=[1, 1])
racebias_hc_plt.set(xlabel = "Year", ylabel = "Number of Hate Crime Incidents",
title = "Change in Number of Hate Crimes By Different Racial Motivation\n in the U.S. Between 1991-2020")
[Text(0.5, 0, 'Year'), Text(0, 0.5, 'Number of Hate Crime Incidents'), Text(0.5, 1.0, 'Change in Number of Hate Crimes By Different Racial Motivation\n in the U.S. Between 1991-2020')]
# Calculate the % difference between the num of hate crime in 2019 v 2020
race_20_hc = racebias_hc[racebias_hc["year"] >= 2019]
racebias_dif = []
for race in bias_cat_dic["race/ethn"]:
race_hc = race_20_hc[race_20_hc["bias_desc"] == race]
num20 = list(race_hc[race_hc["year"] == 2020]["num_race_incident"])[0]
num19 = list(race_hc[race_hc["year"] == 2019]["num_race_incident"])[0]
dif = round(((num20-num19)/num19) * 100,3)
racebias_dif.append([race, dif])
racebias_dif = pd.DataFrame(racebias_dif, columns = ["race/ethn", "percent_dif"]).sort_values("percent_dif",
ascending = False)
# Bar plot of percent dif in num of hate crime incidents by race/ethn
racebias_dif_plt = racebias_dif.plot(kind = "barh", y = "percent_dif",
x = "race/ethn",color=(racebias_dif["percent_dif"] > 0).map({True: "g",
False: "r"}))
racebias_dif_plt.set(xlabel = "Percent Difference of Number of Hate Crime Incidents from 2019-2020",
ylabel = "Race/Ethnicity",
title = "Percent Difference in Number of Hate Crime Incidents\n By Racial Motivators from 2019-2020 in U.S.")
[Text(0.5, 0, 'Percent Difference of Number of Hate Crime Incidents from 2019-2020'), Text(0, 0.5, 'Race/Ethnicity'), Text(0.5, 1.0, 'Percent Difference in Number of Hate Crime Incidents\n By Racial Motivators from 2019-2020 in U.S.')]
# total num incidence since 1991 of each bias category of each state
state_hc = pd.DataFrame(hc.groupby(["state_abbr", "bias_cat"])["incident_id"].count())
state_hc.reset_index(inplace = True)
state_hc.columns = ["state_name", "bias_cat", "num_bias_incident"]
# Calculate total num of hate crime by race/ethn per state in US since 1991
states = [state[0] for state in hc.groupby("state_abbr")["state_abbr"]]
state_bias = []
num_incident =[]
for state in states:
state_df = state_hc[state_hc["state_name"] == state]
num_incident = list(state_df[state_df["bias_cat"] == "race/ethn"]["num_bias_incident"])[0]
state_bias.append([state, num_incident])
race_state_bias = pd.DataFrame(state_bias, columns = ["state", "num_incident"])
# Map of total number of hatecrime by race/ethncity bias per state in the US since 1991
# Based on official documentation: https://python-visualization.github.io/folium/quickstart.html
url = (
"https://raw.githubusercontent.com/python-visualization/folium/master/examples/data"
)
state_geo = f"{url}/us-states.json"
racebias_map = folium.Map(location=[40, -95], zoom_start=3.5)
fig = folium.Figure(width = 800, height = 450)
fig.add_child(racebias_map)
folium.Choropleth(
geo_data=state_geo,
name="choropleth",
data=state_bias,
columns=["state", "num_incident"],
key_on="feature.id",
fill_color="YlGn",
fill_opacity=0.7,
line_opacity=0.2,
legend_name="Total Number of Hate Crime Incidents since 1991",
).add_to(racebias_map)
folium.LayerControl().add_to(racebias_map)
racebias_map
# total number of hatecrime by race/ethncity bias per state in the US in 2020
state_hc_20 = pd.DataFrame(hc[hc["year"] == 2020].groupby(["state_abbr", "bias_cat"])["incident_id"].count())
state_hc_20.reset_index(inplace = True)
state_hc_20.columns = ["state_name", "bias_cat", "num_bias_incident"]
state_hc_20 = state_hc_20[~(state_hc_20["state_name"] == "FS")]
state_hc_20 = state_hc_20[(state_hc_20["bias_cat"] == "race/ethn")]
# Map of total number of hatecrime by race/ethncity bias per state in the US in 2020
# Based on official documentation: https://python-visualization.github.io/folium/quickstart.html
racebias_20_map = folium.Map(location=[40, -95], zoom_start=3.5)
fig = folium.Figure(width = 800, height = 450)
fig.add_child(racebias_20_map)
folium.Choropleth(
geo_data=state_geo,
name="choropleth",
data=state_hc_20,
columns=["state_name", "num_bias_incident"],
key_on="feature.id",
fill_color="YlGn",
fill_opacity=0.7,
line_opacity=0.2,
legend_name="Number of Hate Crime Incidents in 2020",
).add_to(racebias_20_map)
folium.LayerControl().add_to(racebias_20_map)
racebias_20_map
# Group number of hate crime incidents per year in CA
CA_num_hc = pd.DataFrame(CA_hc.groupby("year")["incident_id"].count())
CA_num_hc.reset_index(inplace = True)
CA_num_hc.columns = ["year","num_incidents"]
# Plot reported hate crime incidents from 1991-2020 in CA
CA_num_hc_plot = sns.lineplot(data = CA_num_hc, x = "year", y = "num_incidents")
CA_num_hc_plot.set(xlabel = "Year", ylabel = "Number of Hate Crime Incidents",
title = "Change in Total Number of Hate Crimes\n in CA Between 1991-2020")
[Text(0.5, 0, 'Year'), Text(0, 0.5, 'Number of Hate Crime Incidents'), Text(0.5, 1.0, 'Change in Total Number of Hate Crimes\n in CA Between 1991-2020')]
# Calc num of hate crime incidents by bias motivation per year in CA
CA_num_bias_hc = pd.DataFrame(CA_hc.groupby(["year","bias_cat"])["incident_id"].count())
CA_num_bias_hc.reset_index(inplace = True)
CA_num_bias_hc.columns = ["year", "bias_cat", "num_incident"]
# rearrange columns of num_bias_hc in order to make bar plot
CA_num_bias_hc_t_df = transpose_df(CA_num_bias_hc,bias_cat_dic.keys(), "bias_cat", "num_incident")
# Bar plot of number of reported hate crimes by bias category over the years in CA
CA_num_bias_hc_plt = CA_num_bias_hc_t_df.plot(kind='bar', stacked=True)
CA_num_bias_hc_plt.set(xlabel = "Year", ylabel = "Number of Hate Crime Incidents",
title = "Total Number of Hate Crimes By Motivation\n in CA Between 1991-2020")
plt.legend(bbox_to_anchor = [1, 1])
<matplotlib.legend.Legend at 0x1820a13fdc0>
# Line plot reported hate crime incidents by different bias motivation from 1991-2020 in CA
CA_num_bias_hc_plt = sns.lineplot(data = CA_num_bias_hc[["year","bias_cat", "num_incident"]], x = "year",
y = "num_incident", hue = "bias_cat")
CA_num_bias_hc_plt.legend(bbox_to_anchor=[1, 1])
CA_num_bias_hc_plt.set(xlabel = "Year", ylabel = "Number of Hate Crime Incidents",
title = "Change in Number of Hate Crimes By Motivation\n in CA Between 1991-2020")
[Text(0.5, 0, 'Year'), Text(0, 0.5, 'Number of Hate Crime Incidents'), Text(0.5, 1.0, 'Change in Number of Hate Crimes By Motivation\n in CA Between 1991-2020')]
# Calc num of hate crime by racial bias over years in CA
CA_racebias_hc = pd.DataFrame(CA_hc[CA_hc["bias_cat"] == "race/ethn"].groupby(["year", "bias_desc"])["incident_id"].count())
CA_racebias_hc.reset_index(inplace = True)
CA_racebias_hc.columns = ["year", "bias_desc", "num_race_incident"]
# rearrange columns of num_racebias_hc in order to make bar plot
CA_racebias_hc_t_df = transpose_df(CA_racebias_hc, bias_cat_dic["race/ethn"], "bias_desc", "num_race_incident")
# Bar plot of number of reported hate crimes by race/ethn category over the years
CA_num_racebias_hc_plt = CA_racebias_hc_t_df.plot(kind='bar', stacked=True)
CA_num_racebias_hc_plt.set(xlabel = "Year", ylabel = "Number of Hate Crime Incidents",
title = "Total Number of Hate Crimes By Racial Motivation\n in CA Between 1991-2020")
plt.legend(bbox_to_anchor = [1, 1])
<matplotlib.legend.Legend at 0x18208dd1b50>
# Take top 5 races with most hate crime incidents
top_race = CA_racebias_hc.groupby("bias_desc").sum().sort_values("num_race_incident", ascending = False)
top_5_race = list(top_race.iloc[0:5].index)
top_5_race_hc = CA_racebias_hc[CA_racebias_hc["bias_desc"].isin(top_5_race)]
# Line plot of number of reported hate crimes by top 5 race/ethn category over the years in CA
CA_racebias_hc_plt = sns.lineplot(data = top_5_race_hc[["year", "num_race_incident", "bias_desc"]],
x = "year", y = "num_race_incident", hue = "bias_desc")
CA_racebias_hc_plt.legend(bbox_to_anchor=[1, 1])
CA_racebias_hc_plt.set(xlabel = "Year", ylabel = "Number of Hate Crime Incidents",
title = "Change in Number of Hate Crimes By Different Racial Motivation\n in CA Between 1991-2020")
[Text(0.5, 0, 'Year'), Text(0, 0.5, 'Number of Hate Crime Incidents'), Text(0.5, 1.0, 'Change in Number of Hate Crimes By Different Racial Motivation\n in CA Between 1991-2020')]
# Calculate the % difference in num of hate crimes by racial motivation in 2019 v 2020
CA_race_20_hc = CA_racebias_hc[CA_racebias_hc["year"] >= 2019]
CA_racebias_dif = []
for race in bias_cat_dic["race/ethn"]:
if len(CA_race_20_hc[CA_race_20_hc["bias_desc"] == race]) == 2:
CA_race_hc = CA_race_20_hc[CA_race_20_hc["bias_desc"] == race]
CA_num20 = list(CA_race_hc[CA_race_hc["year"] == 2020]["num_race_incident"])[0]
CA_num19 = list(CA_race_hc[CA_race_hc["year"] == 2019]["num_race_incident"])[0]
dif = round(((CA_num20-CA_num19)/CA_num19) * 100,3)
CA_racebias_dif.append([race, dif])
CA_racebias_dif = pd.DataFrame(CA_racebias_dif, columns = ["race/ethn", "percent_dif"]).sort_values("percent_dif",
ascending = False)
# Bar plot of percent dif in num of hate crime incidents by race/ethn bias 2019 v 2020
CA_racebias_dif_plt = CA_racebias_dif.plot(kind = "barh", y = "percent_dif", x = "race/ethn",
color=(CA_racebias_dif["percent_dif"] > 0).map({True: "g",
False: "r"}))
CA_racebias_dif_plt.set(xlabel = "Percent Difference of Number of Hate Crime Incidents from 2019-2020",
ylabel = "Race/Ethnicity",
title = "Percent Difference in Number of Hate Crime Incidents\n By Racial Motivators from 2019-2020 in CA")
[Text(0.5, 0, 'Percent Difference of Number of Hate Crime Incidents from 2019-2020'), Text(0, 0.5, 'Race/Ethnicity'), Text(0.5, 1.0, 'Percent Difference in Number of Hate Crime Incidents\n By Racial Motivators from 2019-2020 in CA')]
# Calc num of hate crimes by bias category each year by county
CA_num_bias_hc = pd.DataFrame(CA_hc.groupby(["county_cat","year","bias_cat"])["incident_id"].count())
CA_num_bias_hc.reset_index(inplace = True)
CA_num_bias_hc.columns = ["county_cat","year","bias_cat", "num_bias_cat"]
# CA URBAN counties: Bar Plot total num of hate crime by racial motivation 1991-2020
urban_num_bias_hc_t_df = transpose_df(CA_num_bias_hc[CA_num_bias_hc["county_cat"] == "urban"],bias_cat_dic.keys(), "bias_cat", "num_bias_cat")
urban_num_bias_hc_plt = urban_num_bias_hc_t_df.plot(kind='bar', stacked=True)
urban_num_bias_hc_plt.set(xlabel = "Year", ylabel = "Number of Hate Crime Incidents",
title = "Total Number of Hate Crimes By Racial Motivation\n in Urban CA Counties Between 1991-2020")
plt.legend(bbox_to_anchor = [1, 1])
<matplotlib.legend.Legend at 0x1820abdd9a0>
# CA SUBURBAN counties: Bar Plot total num of hate crime by racial motivation 1991-2020
suburban_num_bias_hc_t_df = transpose_df(CA_num_bias_hc[CA_num_bias_hc["county_cat"] == "suburban"],bias_cat_dic.keys(), "bias_cat", "num_bias_cat")
suburban_num_bias_hc_plt = suburban_num_bias_hc_t_df.plot(kind='bar', stacked=True)
suburban_num_bias_hc_plt.set(xlabel = "Year", ylabel = "Number of Hate Crime Incidents",
title = "Total Number of Hate Crimes By Racial Motivation\n in Suburban CA Counties Between 1991-2020")
plt.legend(bbox_to_anchor = [1, 1])
<matplotlib.legend.Legend at 0x1820a87ff40>
# CA RURAL counties: Bar Plot total num of hate crime by racial motivation 1991-2020
rural_num_bias_hc_t_df = transpose_df(CA_num_bias_hc[CA_num_bias_hc["county_cat"] == "rural"],bias_cat_dic.keys(), "bias_cat", "num_bias_cat")
rural_num_bias_hc_plt = rural_num_bias_hc_t_df.plot(kind='bar', stacked=True)
rural_num_bias_hc_plt.set(xlabel = "Year", ylabel = "Number of Hate Crime Incidents",
title = "Total Number of Hate Crimes By Racial Motivation\n in Rural CA Counties Between 1991-2020")
plt.legend(bbox_to_anchor = [1, 1])
<matplotlib.legend.Legend at 0x1820e0d1370>
# Calc num of hate crime by race/ethn motivation for each county over years in CA
CA_county_racebias_hc = pd.DataFrame(CA_hc[CA_hc["bias_cat"] == "race/ethn"].groupby(["year", "county_cat"])["incident_id"].count())
CA_county_racebias_hc.reset_index(inplace = True)
CA_county_racebias_hc.columns = ["year", "county_cat", "num_race_incident"]
# Line plot of number of reported hate crimes by race/ethn category over the years in CA
CA_county_racebias_hc_plt = sns.lineplot(data = CA_county_racebias_hc,
x = "year", y = "num_race_incident", hue = "county_cat")
CA_county_racebias_hc_plt.legend(bbox_to_anchor=[1, 1])
CA_county_racebias_hc_plt.set(xlabel = "Year", ylabel = "Number of Hate Crime Incidents",
title = "Change in Number of Racially Motivated Hate Crimes By County\n in CA Between 1991-2020")
[Text(0.5, 0, 'Year'), Text(0, 0.5, 'Number of Hate Crime Incidents'), Text(0.5, 1.0, 'Change in Number of Racially Motivated Hate Crimes By County\n in CA Between 1991-2020')]
# Calc total num of hate crime since 1991 by each bias category of each county
CA_county_hc = pd.DataFrame(CA_hc.groupby(["county", "bias_cat"])["incident_id"].count())
CA_county_hc.reset_index(inplace = True)
CA_county_hc.columns = ["county", "bias_cat", "num_bias_incident"]
# Obtain list of all counties in CA_county_hc dataset
counties = []
for county in CA_county_hc["county"]:
if county not in counties:
counties.append(county)
# Calculate total number of hate crime by race/ethn bias by county since 1991
county_bias = []
for county in counties:
county_df = CA_county_hc[CA_county_hc["county"] == county]
num_incident = list(county_df[county_df["bias_cat"] == "race/ethn"]["num_bias_incident"])[0]
county_bias.append([county, num_incident])
CA_county_racebias = pd.DataFrame(county_bias, columns = ["county", "num_incident"])
# Map of total number of hatecrime by race/ethncity bias per counties in CA since 1991
# Json file from: https://github.com/python-visualization/folium/blob/main/tests/us-counties.json
county_geo = json.load(open("../datasets/CA_counties.json"))
CA_racebias_map = folium.Map(location=[37.5, -120], zoom_start=5.5)
fig = folium.Figure(width = 600, height = 600)
fig.add_child(CA_racebias_map)
folium.Choropleth(
geo_data=county_geo,
name="choropleth",
data=CA_county_racebias,
columns=["county", "num_incident"],
key_on="properties.name",
fill_color="YlGn",
fill_opacity=0.7,
line_opacity=0.2,
legend_name="Total Number of Racially Motivated Hate Crime Incidents in CA since 1991",
).add_to(CA_racebias_map)
folium.LayerControl().add_to(CA_racebias_map)
CA_racebias_map
# Calc total number of hatecrime by race/ethncity bias per CA county in 2020
county_hc_20 = pd.DataFrame(CA_hc[CA_hc["year"] == 2020].groupby(["county", "bias_cat"])["incident_id"].count())
county_hc_20.reset_index(inplace = True)
county_hc_20.columns = ["county", "bias_cat", "num_bias_incident"]
county_hc_20 = county_hc_20[(county_hc_20["bias_cat"] == "race/ethn")]
# Map of total number of hatecrime by race/ethncity bias per counties in CA in 2020
# Json file from: https://github.com/python-visualization/folium/blob/main/tests/us-counties.json
CA_racebias_20_map = folium.Map(location=[37.5, -120], zoom_start=5.5)
fig = folium.Figure(width = 600, height = 600)
fig.add_child(CA_racebias_20_map)
folium.Choropleth(
geo_data=county_geo,
name="choropleth",
data=county_hc_20,
columns=["county", "num_bias_incident"],
key_on="properties.name",
fill_color="YlGn",
fill_opacity=0.7,
line_opacity=0.2,
legend_name="Total Number of Racially Motivated Hate Crime Incidents in CA in 2020",
).add_to(CA_racebias_20_map)
folium.LayerControl().add_to(CA_racebias_20_map)
#fig.save("./figures/CA_racebias_20.html")
CA_racebias_20_map
# Cleaning spd_incident dataset
import pandas as pd
spd_incident = pd.read_csv("spd_incident.csv")
#Renaming column names
spd_incident.rename(columns = {'Case Number':'case', 'Report Date':'date', 'Incident Time':'time',
'Location':'location', 'Beat':'beat', 'Bias Motivation': 'bias'}, inplace = True)
spd_incident.head()
#Making time column more reader-friendly
for i in range(23):
if spd_incident.iloc[i,2] != '0':
spd_incident.iloc[i,2] = pd.to_datetime(spd_incident.iloc[i,2], format="%H%M").strftime('%I:%M %p')
else:
spd_incident.iloc[i,2] = '12:00 AM' #Special case: If time is 0 military time, then it's 12AM
for i in range(23,63):
if spd_incident.iloc[i,2] != '0':
spd_incident.iloc[i,2] = pd.to_datetime(spd_incident.iloc[i,2]).strftime('%I:%M %p')
else:
spd_incident.iloc[i,2] = '12:00 AM' #Special case: If time is 0 military time, then it's 12AM
for i in range(63,len(spd_incident)):
if spd_incident.iloc[i,2] == '0': #Special case: If time is 0 military time, then it's 12AM
spd_incident.iloc[i,2] = '12:00 AM'
elif spd_incident.iloc[i,2] == '1': #Special case: If time is 1 military time, then it's 1AM
spd_incident.iloc[i,2] = '1:00 AM'
elif spd_incident.iloc[i,2] == '4': #Special case: If time is 1 military time, then it's 4AM
spd_incident.iloc[i,2] = '4:00 AM'
else:
spd_incident.iloc[i,2] = pd.to_datetime(spd_incident.iloc[i,2], format="%H%M").strftime('%I:%M %p')
spd_incident
#Making bias entries all lower case because some all caps and some camel case case
spd_incident["bias"] = spd_incident["bias"].str.lower()
spd_incident.head()
new_biases = []
#Condensing some repetitive biases (e.g.'anti-islamic' is the same as 'anti-muslim')
for bias in spd_incident["bias"]:
if 'anti-religion' in bias:
bias = 'anti-other religion'
elif 'national origin' in bias:
bias = 'anti-other race/ethnicity/national origin'
elif "muslim" in bias:
bias = 'anti-islamic (muslim)'
elif "islamic" in bias:
bias = "anti-islamic (muslim)"
elif "buddhism" in bias: #condensing 'anti-other religion (Buddhism, Shintoism, etc.)' to 'anti-other religion"
bias = "anti-other religion"
elif "religious" in bias:
bias = "anti-other religion"
elif 'ethnicity' in bias:
bias = 'anti-other race/ethnicity/national origin'
elif "mulit" in bias:
bias = 'anti-multi-racial group' #updating mulit to multi (typo)
elif 'jew/anti-catholic' in bias:
bias = 'anti-multi-religion group'
elif 'asian' in bias:
bias = 'anti-asian/pacific islander'
else:
bias = bias.rstrip() #getting rid of unneccesary white space after biases
new_biases.append(bias) #adding cleaned biases to new list
spd_incident["biases"] = new_biases
for i in range(len(spd_incident)):
spd_incident.iloc[i,1] = pd.to_datetime(spd_incident.iloc[i,1]).strftime('%m/%d/%Y')
#Previewing variable types
spd_incident.dtypes
# Check how many missing values there are in each column
na_col = []
print("There are", len(spd_incident), "rows in the spd_incident dataset")
for col in spd_incident.columns:
num_na = len(spd_incident[spd_incident[col].isna()])
if num_na != 0:
print(col, "has", num_na, "missing values")
na_col.append(col)
else:
print(col, "has", num_na, "missing values")
#Exporting clean_spd_incident df as csv
clean_spd_incident = spd_incident.drop(['bias'], axis=1)
clean_spd_incident.to_csv("clean_sac_incident.csv")
clean_spd_incident.head(3)
There are 277 rows in the spd_incident dataset case has 0 missing values date has 0 missing values time has 0 missing values location has 0 missing values beat has 0 missing values bias has 0 missing values biases has 0 missing values
| case | date | time | location | beat | biases | |
|---|---|---|---|---|---|---|
| 0 | 17-29506 | 01/31/2017 | 09:00 AM | 5700 BLOCK OF BROADWAY | 6B | anti-black |
| 1 | 17-38423 | 02/08/2017 | 06:43 PM | 4000 BLOCK OF LA TARRIGA WAY | 5B | anti-hispanic |
| 2 | 17-41871 | 02/10/2017 | 10:30 PM | 3600 BLOCK OF RIVERSIDE BLVD | 4A | anti-islamic (muslim) |
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
sac_data = pd.read_csv("clean_sac_incident.csv")
sac_data['year'] = pd.DatetimeIndex(sac_data['date']).year #Adding column for just year
annual_crime_count = pd.DataFrame(sac_data.groupby("year")["case"].count())
annual_crime_count.reset_index(inplace = True)
annual_crime_count
sns.lineplot(annual_crime_count.iloc[:,0],annual_crime_count.iloc[:,1]).set(title='Sacramento annual crime count from 2017-2021',
ylabel="Number of cases") #Plotting annual number of crimes over time
scale = plt.xticks([2017,2018,2019,2020,2021]) #Re-scaling x-axis (year) range
/opt/anaconda3/lib/python3.9/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. warnings.warn(
#Storing all the types of biases in a list of biases
bias_ls = []
for bias in sac_data["biases"]:
if "\n" not in bias and bias not in bias_ls:
bias_ls.append(bias)
elif "\n" in bias:
# Take into account when multiple biases recorded
sub_biases = bias.split('\n') #multiple biases separated by spaces
for sub_bias in sub_biases:
sub_bias = sub_bias.rstrip()
if sub_bias not in bias_ls:
bias_ls.append(sub_bias)
# Create bias categories: race/ethnicity (race/ethn), "sex", "lgbt", "religion", "disability", "other"
bias_cat_dic = {"race/ethn": ['anti-black','anti-hispanic','anti-asian/pacific islander','anti-other race/ethnicity/national origin',
'anti-asian','anti-white','anti-multi-racial group','anti-arab'],
"sex": ['anti-reproductive rights'],
"lgbt": ['anti-transgender','anti-homosexual','anti-male homosexual (gay)','anti-female homosexual (lesbian)',
'anti-sexual orientation'],
"religion":['anti-islamic (muslim)','anti-jewish','anti-catholic','anti-multi-religion group',
'anti-other religion'],
"disability":['anti-physical disability','anti-disability'],
"other":['unknown', 'anti-multi bias']}
# ID what bias category each hate crime incidence is and add to dataframe
bias_cat = []
for bias in sac_data["biases"]:
if "\n" not in bias:
for key in bias_cat_dic:
bias_lst = bias_cat_dic[key]
if bias in bias_lst:
bias_cat.append(key)
break
else:
# for incidents with more than one biases (these bias_cat begins with "mix")
sub_biases = bias.split("\n")
temp_cat = "mix"
for sub_bias in sub_biases:
for key in bias_cat_dic:
bias_lst = bias_cat_dic[key]
if sub_bias in bias_lst:
temp_cat = temp_cat + "," + key
break
bias_cat.append(temp_cat)
# Add bias category to dataframe
sac_data["bias_cat"] = bias_cat
# sac_data without mix bias
nomix_sac_data = sac_data[~sac_data["bias_cat"].str.contains("mix")]
# Calculate num of hate crime by bias category each year
num_bias_sac = pd.DataFrame(nomix_sac_data.groupby(["year","bias_cat"])["case"].count())
num_bias_sac.reset_index(inplace = True)
num_bias_sac.columns = ["year", "bias_cat", "num_bias_cat"]
# Rearrange dataframe to make stacked bar charts
def transpose_df(df, loop_lst, biastype, colname):
"""
Makes bias category rows into columns to allow for stack bar charts
df = pandas hatecrime dataframe
biastype = bias_cat by default, and can be specified to
colname = column name of interest
"""
t_dic = {}
for bias in loop_lst:
bias_rows = list(df[df[biastype] == bias][colname])
while len(bias_rows) < 5:
bias_rows = [0] + bias_rows
t_dic[bias] = bias_rows
t_df = pd.DataFrame(t_dic, index = range(2017,2022))
return t_df
# rearrange columns of num_bias_hc in order to make bar plot
num_bias_sac_t_df = transpose_df(num_bias_sac,bias_cat_dic.keys(), "bias_cat", "num_bias_cat")
# Bar plot of number of reported hate crimes by bias category over the years
num_bias_sac_plt = num_bias_sac_t_df.plot(kind='bar', stacked=True)
num_bias_sac_plt.set(xlabel = "Year", ylabel = "Number of Hate Crime Incidents",
title = "Total Number of Hate Crimes By Motivation\n in Sacramento Between 2017-2021")
plt.legend(bbox_to_anchor = [1, 1])
<matplotlib.legend.Legend at 0x11e529160>
# Line plot of number of reported hate crimes by bias category over the years
# sac data without mix bias
nomix_sac_data = sac_data[~sac_data["bias_cat"].str.contains("mix")]
# Create num bias category each year (no mix)
num_bias_nomix_sac_data = pd.DataFrame(nomix_sac_data.groupby(["year","bias_cat"])["case"].count())
num_bias_nomix_sac_data.reset_index(inplace = True)
num_bias_nomix_sac_data.columns = ["year", "bias_cat", "num_bias_cat"]
num_bias_nomix_sac_data["prop_bias"] = num_bias_nomix_sac_data["num_bias_cat"]/num_bias_nomix_sac_data.groupby("year")["num_bias_cat"].transform("sum")
sns.lineplot(data = num_bias_nomix_sac_data, x = "year", y = "num_bias_cat", hue = "bias_cat").set(title='Sacramento Hate Crime Count by Bias Category From 2017-2021')
scale = plt.xticks([2017,2018,2019,2020,2021]) #Re-scaling x-axis (year) range
plt.legend(bbox_to_anchor = [1, 1])
# Plot of each type of bias's bias proportion over time (without mix)
sns.lineplot(data = num_bias_nomix_sac_data, x = "year", y = "prop_bias", hue = "bias_cat")
scale = plt.xticks([2017,2018,2019,2020,2021]) #Re-scaling x-axis (year) range
# Calc num of hate crime by each specific racial bias over years
racebias_sac = pd.DataFrame(nomix_sac_data[nomix_sac_data["bias_cat"] == "race/ethn"].groupby(["year", "biases"])["case"].count())
racebias_sac.reset_index(inplace = True)
racebias_sac.columns = ["year", "biases", "num_race_incident"]
# rearrange columns of racebias_hc in order to make bar plot
num_racebias_sac_t_df = transpose_df(racebias_sac, bias_cat_dic["race/ethn"], "biases", "num_race_incident")
# Bar plot of number of reported hate crimes by race/ethn category over the years
num_racebias_sac_plt = num_racebias_sac_t_df.plot(kind='bar', stacked=True)
num_racebias_sac_plt.set(xlabel = "Year", ylabel = "Number of Hate Crime Incidents",
title = "Total Number of Racially-motivated Hate Crimes\n in Sacramento Between 2017-2021")
plt.legend(bbox_to_anchor = [1, 1])
<matplotlib.legend.Legend at 0x11e9ac100>
# Take top 5 races with most hate crime incidents to plot a line plot
top_race = racebias_sac.groupby("biases").sum().sort_values("num_race_incident", ascending = False)
top_5_race = list(top_race.iloc[0:5].index)
top_5_race_sac = racebias_sac[racebias_sac["biases"].isin(top_5_race)]
# Line plot of number of reported hate crimes by top 5 race/ethn category over the years
racebias_sac_plt = sns.lineplot(data = top_5_race_sac[["year", "num_race_incident", "biases"]],
x = "year", y = "num_race_incident", hue = "biases")
racebias_sac_plt.legend(bbox_to_anchor=[1, 1])
racebias_sac_plt.set(xlabel = "Year", ylabel = "Number of Hate Crime Incidents",
title = "Change in Number of Hate Crimes By Different Racial Motivation\n in Sacramento Between 2017-2021")
scale = plt.xticks([2017,2018,2019,2020,2021])
#Counting # of hate crimes per type of bias in each bias category (without mix since so few)
group_counts = nomix_sac_data.groupby(['bias_cat','biases']).size().sort_values(ascending = False)
group_counts.to_frame(name = 'count').reset_index().drop_duplicates("bias_cat") #getting top counts of bias crimes for unique bias categories
| bias_cat | biases | count | |
|---|---|---|---|
| 0 | race/ethn | anti-other race/ethnicity/national origin | 91 |
| 2 | lgbt | anti-homosexual | 35 |
| 4 | religion | anti-other religion | 12 |
| 12 | other | anti-multi bias | 4 |
| 16 | disability | anti-disability | 1 |
| 21 | sex | anti-reproductive rights | 1 |
#Counting # of hate crimes per category of bias per year (without mix since so few)
group_counts_years = nomix_sac_data.groupby(['year','bias_cat']).size().sort_values(ascending = False)
group_counts_years.to_frame(name = 'count').reset_index().drop_duplicates("year").sort_values(by='year') #getting top counts of bias crimes for unique bias categories
| year | bias_cat | count | |
|---|---|---|---|
| 6 | 2017 | race/ethn | 13 |
| 3 | 2018 | lgbt | 18 |
| 5 | 2019 | lgbt | 14 |
| 1 | 2020 | race/ethn | 37 |
| 0 | 2021 | race/ethn | 88 |
#Counting # of hate crimes per bias per year (without mix since so few)
group_counts_years = nomix_sac_data.groupby(['year','biases']).size().sort_values(ascending = False)
group_counts_years.to_frame(name = 'count').reset_index().drop_duplicates("year").sort_values(by='year') #getting top counts of bias crimes for unique bias categories
| year | biases | count | |
|---|---|---|---|
| 6 | 2017 | anti-black | 9 |
| 2 | 2018 | anti-homosexual | 17 |
| 4 | 2019 | anti-homosexual | 13 |
| 3 | 2020 | anti-black | 16 |
| 0 | 2021 | anti-other race/ethnicity/national origin | 88 |
import requests
#Making request to get latitudes and longitudes for beats
url = 'https://opendata.arcgis.com/datasets/0d7615bf9b1e47948046a82b261d2384_0.geojson'
response = requests.get(url)
results = response.json()
#Adding beat coordinates to list
beats = []
coords = []
for i in range(0,len(sac_data['beat'].unique())):
beats.append(results['features'][i]['properties']['BEAT'])
coords.append(results['features'][i]['geometry']['coordinates'])
ls_coords = pd.DataFrame(coords,columns = ['coords']).reset_index() #list of list of coords
#Typo in entry nomix_sac_data[91]; '6d' instead of '6D'
nomix_sac_data['beat'][91] = '6D'
ls_dict = {}
#all_coords =
for i in range(0,len(sac_data['beat'].unique())):
ls_dict["beat{x}_coords".format(x=i)] = [item for item in ls_coords['coords'][i]]
#How to access a set of coords in that beat example:
#ls_dict["beat18_coords"][1]
#Aligning each each with number 1-20 for each of the beats
beats_dict = {}
for i in range(0,len(sac_data['beat'].unique())):
beats_dict[beats[i]] = i
#Assigning the beat number count to new column in df
nomix_sac_data["beat_num"] = nomix_sac_data["beat"].apply(lambda x: beats_dict.get(x))
#Assigning each beat in each row with a unique coordinate within the list of coords for that beat
#Delete that coordinate from the beat list each time; always take first coordinate
coord_col = []
for beat in nomix_sac_data["beat_num"]:
num = int(beat)
coord = ls_dict["beat{x}_coords".format(x=num)][0]
ls_dict["beat{x}_coords".format(x=num)].pop(0)
coord_col.append(coord)
#Removing the coordinate from the list of coordinates so won't be reused
#for val in ls_dict.values():
# val.remove(coord)
#Creating latitude and longitude columns for the coordinates
nomix_sac_data["latitude"] = [coord[1] for coord in coord_col]
nomix_sac_data["longitude"] = [coord[0] for coord in coord_col]
nomix_sac_data.head(3)
/var/folders/17/cr089j555rg8wltl89yrwtjr0000gn/T/ipykernel_90154/4031522383.py:19: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy nomix_sac_data['beat'][91] = '6D' /var/folders/17/cr089j555rg8wltl89yrwtjr0000gn/T/ipykernel_90154/4031522383.py:36: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy nomix_sac_data["beat_num"] = nomix_sac_data["beat"].apply(lambda x: beats_dict.get(x)) /var/folders/17/cr089j555rg8wltl89yrwtjr0000gn/T/ipykernel_90154/4031522383.py:53: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy nomix_sac_data["latitude"] = [coord[1] for coord in coord_col] /var/folders/17/cr089j555rg8wltl89yrwtjr0000gn/T/ipykernel_90154/4031522383.py:54: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy nomix_sac_data["longitude"] = [coord[0] for coord in coord_col]
| Unnamed: 0 | case | date | time | location | beat | biases | year | bias_cat | beat_num | latitude | longitude | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 17-29506 | 01/31/2017 | 09:00 AM | 5700 BLOCK OF BROADWAY | 6B | anti-black | 2017 | race/ethn | 18 | 38.553226 | -121.436635 |
| 1 | 1 | 17-38423 | 02/08/2017 | 06:43 PM | 4000 BLOCK OF LA TARRIGA WAY | 5B | anti-hispanic | 2017 | race/ethn | 15 | 38.459592 | -121.446556 |
| 2 | 2 | 17-41871 | 02/10/2017 | 10:30 PM | 3600 BLOCK OF RIVERSIDE BLVD | 4A | anti-islamic (muslim) | 2017 | religion | 11 | 38.568844 | -121.512322 |
#Plotting points on folium map
import folium
# Make a map of Sacramento
m = folium.Map(location = [38.5816, -121.4944], zoom_start = 12) #Getting map of Sacramento
nomix_sac_data['coords'] = list(zip(nomix_sac_data['latitude'], nomix_sac_data["longitude"],nomix_sac_data.bias_cat))
def produce_color(x):
if x == 'lgbt':
return 'red'
elif x == 'race/ethn':
return 'blue'
elif x == 'religion':
return 'green'
elif x == 'sex':
return 'yellow'
elif x == 'disability':
return 'purple'
else:
return 'brown'
for coord in nomix_sac_data['coords']:
folium.Circle(location=[coord[0], coord[1]], color = produce_color(coord[2])).add_to(m)
m.save('mapit.html')
m
/var/folders/17/cr089j555rg8wltl89yrwtjr0000gn/T/ipykernel_90154/2601174913.py:7: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy nomix_sac_data['coords'] = list(zip(nomix_sac_data['latitude'], nomix_sac_data["longitude"],nomix_sac_data.bias_cat))
#downloading the needed libraries
import praw
import pandas as pd
import nltk
import nltk.corpus
#connecting to the API
reddit = praw.Reddit(client_id='[redacted]', client_secret='[redacted]', user_agent='[redacted]')
#since we're focusing on the Sacramento subreddit, I'm first going to get the description of the subreddit
sacramento = reddit.subreddit('Sacramento')
print(sacramento.description)
#getting top 10 popular posts from Sacramento's subreddit
corpus = []
popular=reddit.subreddit('Sacramento').hot(limit=10)
for posts in popular:
print(posts.title)
corpus.append(posts.title) # so that we can do NLP later
df = pd.DataFrame(corpus)
df
#getting comments
submission = reddit.submission(url="https://www.reddit.com/r/Sacramento/comments/tadt1s/investigation_underway_into_racially_charged/")
corpus1=[]
submission.comments.replace_more(limit=0)
for comment in submission.comments.list():
corpus1.append(comment.body)
print(comment.body)
#reading comments into a txt file and splitting into words
comments = nltk.corpus.gutenberg.raw("[file name]")
#splitting the comments into words
words = nltk.word_tokenize(comments)
#frequency analysis
def get_freq_doc(doc):
words = (w.lower() for w in nltk.word_tokenize(doc))
words = (w for w in words if w not in ["the", "a", "an"] and w.isalnum())
return nltk.FreqDist(words)
df = pd.DataFrame([get_freq_doc(doc) for doc in words])
df = df.fillna(0)
df
#one-hot encoding
words = nltk.word_tokenize(comments)
from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer(tokenizer = nltk.word_tokenize)
freq = vec.fit_transform(words)
vec.get_feature_names()
from sklearn.preprocessing import Binarizer
binarizer = Binarizer()
ohot = binarizer.fit_transform(freq)
ohot.todense()
#Now focusing on California's subreddit and getting most popular posts
lst = []
popular=reddit.subreddit('California').hot(limit=10)
for posts in popular:
print(posts.title)
lst.append(posts.title) # so that we can do NLP later
df = pd.DataFrame(lst)
df
#getting comments
submission = reddit.submission(url="https://www.reddit.com/r/California/comments/te5qjz/california_legislators_are_in_agreement_its_time/")
corpus1=[]
submission.comments.replace_more(limit=0)
for comment in submission.comments.list():
corpus1.append(comment.body)
print(comment.body)
cali_comments = nltk.corpus.gutenberg.raw("[file name]")
words = nltk.word_tokenize(cali_comments)
#Frequency Analysis
def get_freq_doc(doc):
words = (w.lower() for w in nltk.word_tokenize(doc))
words = (w for w in words if w not in ["the", "a", "an"] and w.isalnum())
return nltk.FreqDist(words)
df = pd.DataFrame([get_freq_doc(doc) for doc in words])
df
| there | are | multiple | root | cause | of | problems | main | one | is | ... | sneaky | let | mostly | white | bar | minority | from | communities | bigots | bigot | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1 | NaN | 1.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2 | NaN | NaN | 1.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 3 | NaN | NaN | NaN | 1.0 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | NaN | NaN | NaN | NaN | 1.0 | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 307 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.0 | NaN |
| 308 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 309 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 310 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.0 |
| 311 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
312 rows × 150 columns
#top post:
top_post = nltk.corpus.gutenberg.raw("[file name]")
words_top = nltk.word_tokenize(top_post)
from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer(tokenizer = nltk.word_tokenize)
freq = vec.fit_transform(words_top)
x=vec.get_feature_names()
df = pd.DataFrame (x, columns = ['Words'])
corpus=[]
new= df['Words'].str.split()
new=new.values.tolist()
corpus=[word for i in new for word in i]
stop=["the", "a", "an"]
counter=Counter(corpus)
most=counter.most_common()
import seaborn as sns
def _get_top_ngram(corpus, n=None):
vec = CountVectorizer(ngram_range=(n, n)).fit(corpus)
bag_of_words = vec.transform(corpus)
sum_words = bag_of_words.sum(axis=0)
words_freq = [(word, sum_words[0, idx])
for word, idx in vec.vocabulary_.items()]
words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
return words_freq[:10]
top_n_bigrams=_get_top_ngram(corpus,2)[:10]
x,y=map(list,zip(*top_n_bigrams))
sns.barplot(x=y,y=x)
#source: https://neptune.ai/blog/exploratory-data-analysis-natural-language-processing-tools
#Frequency Analysis
def get_freq_doc(doc):
words = (w.lower() for w in nltk.word_tokenize(doc))
words = (w for w in words if w not in ["the", "a", "an"] and w.isalnum())
return nltk.FreqDist(words)
df = pd.DataFrame([get_freq_doc(doc) for doc in words_top])
df